|
Mass data clean system based on regular expression
CHANG Zheng, LYU Yong
Journal of Computer Applications
2019, 39 (10):
2942-2947.
DOI: 10.11772/j.issn.1001-9081.2019030492
Based on the current mainstream Extract Transform Load (ETL) tools for data and the disadvantages of some applications in restricted environments, a Regular Expression Mass-data Cleaning System (REMCS) was proposed for the specific requirements in the restricted application scenarios. Firstly, the data features of six typical problems including ultra-long error data, batch fusion of data source files, automatic sorting of data source files, were discovered. And the appropriate regular expressions and pre-processing algorithms were put forward according to the data features. Then, data pre-processing was completed by using the algorithm model to remove the errors in data. At the same time, the system logical structure, common problems, and corresponding solutions, and code implementation scheme of REMCS were described in detail. Finally, the comparison experiments of several common data processing problems were carried out with the following four aspects:the compatible data source file formats, the soveble problem types, the problem processing time and the data processing limit value. Compared with the traditional ETL tools, REMCS can support nine typical file formats such as csv format, json format, dump format, and can address all six common problems with shorter processing time and larger supportable data limit value. Experimental results show that REMCS has better applicability and high accuracy for common data processing problems in restricted application scenarios.
Reference |
Related Articles |
Metrics
|
|